Getting started GenAI & LLM with my Udemy course, Hands-on Generative AI Engineering with Large Language Model 👇
title: “Finetuning Qwen2.5-3B using DPO with Unsloth” ## Introduction
Qwen2.5-3B is a large-scale, pretrained language model comprising 3.09 billion parameters. This model provides a balance between expressiveness and computational feasibility, making it well-suited for finetuning using more specialized optimization strategies. In this guide, we will explore using Direct Preference Optimization (DPO) within the Unsloth framework to fine-tune Qwen2.5-3B. This approach focuses on aligning model responses to specific preferences and behaviors, yielding fine-tuned models that can respond to tasks more accurately.
Fine Tuning with DPO
Direct Preference Optimization (DPO) is a technique used for fine-tuning language models in scenarios where it is critical to optimize for specific outputs based on preferences, such as ranking user responses. By using DPO with the LoRA (Low-Rank Adaptation) technique, we can leverage efficient finetuning by only modifying a small subset of the model’s parameters. This keeps training costs low while maintaining flexibility in the model’s outputs.
Use Case
For this example, our use-case involves using DPO to fine-tune Qwen2.5-3B to generate engaging Tiny Stories tailored for children. By providing prompts and desired responses (and contrasting rejected responses), we can better shape the model to deliver coherent, engaging, and task-appropriate stories for various instructions.
Implementation
Step 1: Import Necessary Libraries
from unsloth import PatchDPOTrainer
PatchDPOTrainer()
import os
import torch
from datasets import load_dataset
from transformers import TrainingArguments, TextStreamer
from unsloth import FastLanguageModel, is_bfloat16_supported
from trl import DPOConfig, DPOTrainer
from google.colab import userdataStep 2: Initialize Comet ML for Experiment Tracking
import comet_ml
comet_ml.login(project_name="dpo-lora-unsloth")Step 3: Load Pretrained Model and Tokenizer
max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="Qwen/Qwen2.5-3B",
max_seq_length=max_seq_length,
load_in_4bit=False,
)Step 4: Apply LoRA Adaptation
model = FastLanguageModel.get_peft_model(
model,
r=32,
lora_alpha=32,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
)Step 5: Dataset Preparation
Format the dataset using a specific template and split it for training and testing.
alpaca_template = """Below is an instruction that describes a task.
Write a response that appropriately completes the request.
### Instruction:
{}
### Response:
"""
EOS_TOKEN = tokenizer.eos_token
def format_samples(example):
example["prompt"] = alpaca_template.format(example["prompt"])
example["chosen"] = example['chosen'] + EOS_TOKEN
example["rejected"] = example['rejected'] + EOS_TOKEN
return {
"prompt": example["prompt"],
"chosen": example["chosen"],
"rejected": example["rejected"]
}
dataset = dataset.map(format_samples)
dataset = dataset.train_test_split(test_size=0.05)Step 6: Training Using DPOTrainer
Configure and train the model using the DPOTrainer class.
trainer = DPOTrainer(
model=model,
ref_model=None,
tokenizer=tokenizer,
beta=0.5,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
max_length=max_seq_length//2,
max_prompt_length=max_seq_length//2,
args=DPOConfig(
learning_rate=2e-6,
lr_scheduler_type="linear",
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
gradient_accumulation_steps=8,
num_train_epochs=1,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
optim="adamw_8bit",
weight_decay=0.01,
warmup_steps=10,
output_dir="output",
eval_strategy="steps",
eval_steps=0.2,
logging_steps=1,
report_to="comet_ml",
seed=0,
),
)
trainer.train()Step 7: Model Inference
Generate a response using the fine-tuned model.
FastLanguageModel.for_inference(model)
message = alpaca_template.format("Write a story about a humble little bunny \
named Ben who follows a mysterious trail in the woods, \
discovering beautiful flowers, new friends, and a lovely pond along the way.", "")
inputs = tokenizer([message], return_tensors="pt").to("cuda")
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=2048, use_cache=True)Step 8: Save and Push to Hugging Face Hub
from huggingface_hub import login
# Log in to the Hugging Face Hub
login(token=userdata.get('HF_TOKEN'))
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
model.push_to_hub_merged("tanquangduong/Qwen2.5-3B-DPO-TinyStories", tokenizer, save_method="merged_16bit")Inference
Using the fine-tuned model for generating outputs:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("tanquangduong/Qwen2.5-3B-DPO-TinyStories")
model = AutoModelForCausalLM.from_pretrained("tanquangduong/Qwen2.5-3B-DPO-TinyStories")
alpaca_template = """Below is an instruction that describes a task.
Write a response that appropriately completes the request.
### Instruction:
{}
### Response:
{}"""
model = model.to("cuda")
FastLanguageModel.for_inference(model)
message = alpaca_template.format("Write a story about a humble little bunny \
named Ben who follows a mysterious trail in the woods, \
discovering beautiful flowers, new friends, and a lovely pond along the way.", "")
inputs = tokenizer([message], return_tensors="pt").to("cuda")
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=2048, use_cache=True)Conclusion
In this guide, we demonstrated how to fine-tune Qwen2.5-3B using Direct Preference Optimization (DPO) within the Unsloth framework. By leveraging LoRA for parameter-efficient adaptation, we tailored the model’s output behavior to better suit our target use case of generating child-friendly Tiny Stories. This methodology highlights the effectiveness of combining DPO and LoRA to achieve powerful, specialized fine-tuned models.